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SP1 PROJECTION STORED 


The projection creates quick ON DISK 
| sales pathways to access sales data 
for each SKU for this sales 
SP1 | person. This enables Vertica to 
i CUR speedily aggregate totals, 
because the relevant data is The projection is ordered 
clustered together and can be by sales person so all the 


read sequentially. data for any sales person 
SKU 4 sales can be quickly located 
and read sequentially 
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OLTP INDEX WITH HIGH CARDINALITY 


Encoding 


READ 
Due to run-Length encoding (default), Vertica 


knows that there are 600K entries which have 
the FEMALE gender attribute. So a COUNT(*) 
F with a correlation on gender - 'F' is almost 
instantaneous, no sequential scan required. 
Since we are only sorting and ordering by 
gender, the other field attributes are 
sequentially encountered randomly. 


What is said above for the F gender applies here 

M also to the M gender. Due to RLE encoding, 
Vertica already knows that there are 400K records 
with gender = 'M'. All further field attributes are 
sequentially encountered randomly. 


OLTP vs OLAP = Fast Lookup vs Fast 
Aggregation 


Ф 
МуУ 


For OLTP databases, QP picks the best index(es) to 
use; QP can and will ignore indexes that provide no 
benefit such as indexes on low-cardinality data, 
since sequential scan may be just as fast. 


For OLAP databases, QP picks best columnar store 
to use; QP will default to base columnar store if a 
better one for the query does not exist. 


DBA CHALLENGE: Creating indexes and columnar 
stores optimized for fast lookup or aggregate 
queries. 


Sample 
Query: OLTP SELECT COUNT(*) FROM students 


vs OLAP WHERE gender = ‘F’ AND CLASS = 


OLTP Query 
Processing 


eindex does not contain 


any gender or class 
information. Thus, all 
data pages must be read 
and filtered. 


INDEX ON 
students.student id 


READ 


Now the data is sorted on class as well as 
gender. The data beyond gender is not 
random as it was before — it is now 
sequentially ordered by class as well. So as 
the data is read sequentially, Female 
Freshman are first clustered together, 
followed by Female Juniors, etc. Due to RLE, 
the total count of Freshmen who are female 
is already known. The same applies to all 
other classes, and to the "M” gender and its 
classes, too. However, any further field 
attributes are random (such as student id) 


Not an efficient design for 
aggregation and analytics because 
the database query planner will 


ignore indexes on low cardinality 
columns, resulting in sequential 
scans. This is why OLTP databases 
do not perform well in BI roles. 


Run-length encoding (RLE) is common for low cardinality 
columns. Example column is “gender” which is mostly 50/50 
distribution M/F. 


SELECT gender, count(*) FROM users GROUP BY 1; is a slow 
query on OLTP, because gender column cannot be effectively 


indexed (low cardinality). But very fast in OLAP due to ordered 
columnar storage and RLE. 


OLAP stores data in immutable “storage containers.” Any 
update or delete means that the storage container has to be 
destroyed and recreated. Very inefficient! 


Storage containers on 
disk are generally 
immutable. Any 

Update/Delete 
operation requires the 
storage container to 
be destroyed and 
rebuilt; very 
inefficient. 


MERGE data loading 
is inefficient. 


* Databases originally designed with 
update/delete capability to reduce storage 
costs which were very expensive in 1970s 


* Due to Moore's Law, storage costs have 
decreased dramatically 


* Analytic tools like Vertica have increased value 
of data 


* Change history now a critical aspect of data 
analytics 


* Use SQL techniques to generate “latest 
Snapshot” of table data 


* Preserves change history 
* Reduces storage container “churn” 


* Best if OLTP does “soft deletes” so that it is 
also essentially “append only” 
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Coding Top-K Projections 


AS SELECT select-expression FROM table 
IMIT num-rows OVER (PARTITION BY expression ORDER BY column-expr); 


For the latest record, K=1. Here is how we would create a Top-K projection for listings (Note 
have simplified the column list for clarity): 
CREATE PROJECTION listings latest 

AS SELECT listing id, update date, title FROM table 


LIMIT 1 OVER (PARTITION BY listing id ORDER BY update date 
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Wrapping Top-K 


РГО) ections with * Simplify queries by using views to wrap Top-K projections; removes 


the Top-K code 


Views 


CREATE VIEW v_listings latest AS 
SELECT listing id, update date, title 
FROM listings 
LIMIT 1 OVER (PARTITION BY listing id ORDER BY update date DESC); 


Now to get the latest state of all listings, we just need query the view, without the need to work in 
that LIMIT 1 OVER()... predicate: 


SELECT * FROM v listings latest; 
Of course, individual records can also be pulled from the view: 


SELECT * FROM v listings latest WHERE listing id - 12345; 


It is used to correlate changes over time to 
performance over time 


Because all data via ELT into MicroFocus EDW IS 
APPEND-only, by definition, the initial, pristine staging 
table contains all versions of records, and as such is 
the full Change History. We use the view wrapping the 
Top-K projection to get latest snapshot. 


Referenti 
al 
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Referential integrity is important for maintaining clean data. 
But where is the best place to do that? 


OLTP databases are the best place to specify and enforce 
referential integrity, due to efficient indexing of columns. 


OLAP databases essentially inherit the referential integrity 
enforced upstream in OLTP systems. No need to repeat! 


The inheritance creates “eventually consistent” referential 
integrity, akin to “eventually consistent” replication. 


MicroFocus EDW (Vertica) sources data via ELT from Oracle, 
SQLServer, SalesForce and clickstream and other logs. No 
primary or foreign keys are used, we rely on and trust 
upstream referential integrity enforcement. Works great! 


Benefits of Eschewing Referential Integrity Constraints 


in OLAP Databases 


Why redo in OLAP what has 
already been done upstream in 
OLTP? 


High velocity fact tables can be 
loaded more frequently than 
slowly changing dimension 


tables, without breaking things. 


Again, due to “Eventually 
Consistent Referential 
Integrity”. 


Makes ELT easier because the 
order of data arrival is not 
important: “Eventually 
Consistent Referential 
Integrity”. 


Only consideration is that we 

must query after all lag time is 
taken into consideration, such 
as end-of-quarter summations. 


Summary 
of Best 
Practices 
for OLTP 


and OLAP 


If possible, design OLTP data to feature soft deletes instead of hard 


Use append-only ELT data loading into OLAP database 


No need to use MERGE commands 


Use Top-K projections (or equivalent) to generate materialized 
views of snapshots of latest states of tables in OLAP database 


Design OLAP columnar stores (projections) with left-most columns 
having lowest cardinality (inverse of indexes in OLTP databases) 


Never use SQL “DELETE” command - causes too many delete 
vectors. Instead, use TRUNCATE to empty a table, or drop 
partitions 


Change History is a no-cost benefit; make analysts aware of this! 
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